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Abstract. In evolutionary biology, genetic sequences carry with them a trace 
of the underlying tree that describes their evolution from a common ancestral 
sequence. The question of how many sequence sites are required to recover 
this evolutionary relationship accurately depends on the model of sequence 
evolution, the substitution rate, divergence times and the method used to 
infer phylogenetic history. A particularly challenging problem for phylogenetic 
methods arises when a rapid divergence event occurred in the distant past. 
We analyse an idealised form of this problem in which the terminal edges 
of a symmetric four— taxon tree are some factor (p) times the length of the 
interior edge. We determine an order p 2 lower bound on the growth rate for 
the sequence length required to resolve the tree (independent of any particular 
branch length). We also show that this rate of sequence length growth can 
be achieved by existing methods (including the simple 'maximum parsimony' 
method) , and compare these order p 2 bounds with an order p growth rate for a 
model that describes low-homoplasy evolution. In the final section, we provide 
a generic bound on the sequence length requirement for a more general class 
of Markov processes. 
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1. Introduction 

When sequence sites evolve independently under a Markov process along the 
branches of a tree T, the sequences observed at the tips contain information con- 
cerning the underlying tree. This allows for the tree T to be reconstructed ac- 
curately from sufficiently long sequences; this is the basis of modern molecular 
systematics [3] . The number of sites required to reconstruct T accurately depends 
on how long the edges of the tree are. More precisely, it depends on the expected 
number of substitutions on each branch (edge) e of the tree - which we refer to as 
the branch length of e (this is the product of the temporal duration of the branch 
and the substitution rate). 

A number of authors (e.g. [21 [El [HI EH \17\ [18] ) have considered various ways 
to quantify the phylogenetic signal in aligned DNA sequences, and to estimate the 
sequence length required to reconstruct a phylogenetic tree. Most of these studies 
have involved simulation or heuristic approaches, although some analytical bounds 
have also been obtained [H [14] . Typically, these bounds state that if an interior 
branch length is very short, or if a terminal (external) branch length is long, then 
a large number of sites will be required. 

In this paper we explore these results further by obtaining bounds that are ex- 
pressed purely in terms of the relative sizes of the branch lengths, not their absolute 
values. One motivation for our approach is that different genes are known to evolve 
at different rates, so that any particular branch length will depend on which gene 
is considered; however, the ratios of the branch lengths will be unchanged if the 
gene-specific rate applies uniformly across the tree. 

A particularly difficult tree reconstruction problem, requiring long sequences to 
resolve, arises when one has an interior edge with a short branch length incident 
with edges (or subtrees) having large branch lengths. Such a scenario occurs, for 
example, when a relatively rapid speciation event (leading to the short branch 
length for that edge) occurred in the distant past (leading to the large branch 
lengths for the incident edges). Several examples of this have been highlighted in the 
literature [61 \TU\ and include the origin of metazoa and the origin of photosynthesis. 

In this paper we analyse a scenario which, although somewhat idealised, never- 
theless captures the essence of this problem - a four-taxon tree, where the terminal 
edges have equal branch lengths that arc p > 1 times the branch lengths of the in- 
terior edge, and a simple symmetric model of site evolution (specifically, we assume 
sites evolve independently according to a common two-state Markov process). 

We provide a mathematical analysis to the question of how many sites are re- 
quired to resolve the tree correctly (from the three possible resolved topologies on 
four taxa). We are particularly interested in how the growth of the sequence length, 
k, depends on p, independent of the absolute value of a particular edge length. We 
establish that k must grow at the rate p 2 , which implies that regardless of how fast 
(or slow) any particular sequence is evolving, we can set definite lower bounds on 
the length of sequences required to resolve the tree. We then show that for our set- 
ting, p 2 growth in k is the best possible, as an existing method (namely, maximum 
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parsimony) achieves this bound. Our results complement an earlier simulation- 
based analysis [15] . We contrast our results by considering a quite different model 
of site evolution (the infinite state model) and establishing that order p growth in 
k can sometimes suffice for this model. 

We also extend the approach to more general markov processes on trees, obtain- 
ing exact, but less explicit lower bounds on k and which involve absolute (rather 
than relative) branch lengths. Our arguments are based on standard techniques 
from probability theory, such as central limit approximation, and information- 
theoretic arguments based on the properties of Hellinger distance. 



Consider an unrooted binary phylogenetic tree on four taxa, say 12|34, with 
branch length x for the interior edge es and px for the terminal edges e\,...,e&, 
where p > 1. This is illustrated in Fig. 1(a), and the topology of the tree is shown 
at the top of Fig. 1(b). The other two competing topologies (13| 24 and 14| 23) 
are also shown in Fig. 1(b). Here branch length refers to the expected number of 
substitutions under some continuous time substitution process. 



Figure 1. (a) The generating tree with interior branch length x 
and all four terminal branch lengths equal to px. (b) This tree has 
the topology 12|34, while the other two binary topologies are 13 124 
and 14|23. 

Recall that a binary character or site pattern refers to an assignment to each 
taxon of a state from some two-element set, which we will denote through this 
paper as {a, /?}. 

Suppose that a sequence of binary characters are generated independently and 
identically (i.i.d.) under a symmetric two-state model on the tree. This model is 
often called the CFN (Cavender-Farris-Neyman model) or more briefly the Neyman 
2-state model (for more details see e.g. [3]). Although it is the simplest non-trivial 
Markov process on a tree, it allows for an exact analysis. Moreover, stochastic 
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results for this model typically extend to more general finite-state models where an 
exact analysis is usually more complex [8], and in Section [5] we show how some of 
our approaches extend to more general Markov processes. 

If we denote the substitution probability on edge ej by -P(ei), then for each 
terminal edge we have P{ei) = ^(1 — 2exp(— 2px)) while for the central edge e$, 
we have P(e 5 ) = |(1 - 2 exp(-2a;)). Let 9 t = 1 - 2P(e;) for i = 1, ... ,5. Then we 
can express these five Oi values in terms of 9 :— e~ 2x as follows: 

6i = 6 P for i = 1, . . . , 4; and 9 5 = 6. 

Now, if we fix x and let p grow, or, alternatively, if we fix px and let x tend 
to zero, then the sequence length k required to reconstruct the topology of the 
generating tree accurately tends to infinity. This holds for any tree reconstruction 
method that treats all three topologies fairly (if a method has an a priori preference 
for one topology, it will perform worse on an alternative topology). For example, 
if px is fixed, then k grows at the rate ^ as i tends to zero (by Theorem 4.1 of 
|14|). However, if we do not fix x or px in advance two fundamental questions 
arise: what is the slowest rate that k can possibly grow as a function of pi and 
(ii) does some value of x (dependent on p) achieve this rate of growth for a certain 
tree reconstruction method? We will see that for the simple scenario described, the 
answers to these questions are (i) p 2 and (ii) yes (up to a constant factor). 



3. Lower bounds 

The main result of this section is the following: 

Theorem 3.1. Suppose k sites evolve i.i.d. under a symmetric two-state model on 
some (unknown) four-taxon tree that has branch length x on the interior edge and 
px on each terminal edge. Then any method that is able to correctly identify the 
underlying tree topology with probability at least 1 — e requires: 

k > c £ • p 2 

for any x, where c e = |(1 — |e) 2 - 

To establish this result we require some preliminary results. We begin with a 
general information-theoretic bound on the number of i.i.d. observations required 
to reconstruct a discrete parameter in a general setting. 

Suppose one has a finite set A, and each element a € A has an associated 
probability distribution on a finite set U . Suppose we observe k observations from 
U that are generated independently by the same unknown element a € A. Suppose, 
furthermore, that some method M estimates the element of A that generated our 
observations and does so correctly with probability at least 1 — e (regardless of 
which element a actually generated the data). Then we can set a lower bound on k 
in terms of a stochastic distance between elements of A. Recall that the Hellinger 
distance of two elements a, a' G A is defined as follows. If p and q denote the 
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probability distribution induced by a and a' respectively then let: 

(1) d 2 H (a, a') := {y/p^ - ^/q^f = 2 I 1 - ^ ^/p u q u j ■ 

The latter equality holds as ^ Vu = X) 9" = 1- The following result is from [T3] 
(Theorem 3.1 and (2.7)). 

Lemma 3.2. If there is a subset A' of A of size m > 2 for which dn(a, a') < d 
for all a, a' £ A' and some method M correctly identifies each element of A' with 
probability at least 1 — e from k independently- generated elements in some set U , 
then: 

fc> - 1 -efd- 2 . 

4 m — 1 

In our setting, ^4 will consist of the three binary four-taxon trees on leaf set 
{1, 2, 3, 4}, U will consist of the assignment of states of the elements of this leaf set, 
and m will be 3 (in this section) or 2 (in Section [5]). 

Let S be the set of possible binary site patterns on {1, 2, 3, 4}. These consist of 
the site patterns si := aa/3/3, S2 ■= a(5a(5 and S3 := a/3/3a, and five non-informative 
ones Si, . . . , ss (note that pairs of complementary site patterns - for example aa[3(3 
and (ifiaa - are regarded as equivalent). For any site pattern s e S, let p s = P(s|7i) 
(respectively q s = P(s|72)) be the probability that the site pattern s is generated 
on Ti (respectively T 2 ). We can express the probabilities p Sl and p S2 in terms of 
9 = e~ 2x by using the Hadamard representation of [1] (see [13], Section 8.6). We 
have: 

(2) Psi = I ■ (1 + 2 • e 2 p - 4 ■ e 2 p +1 4 

and: 

(3) p 53 = i.(l-2-^ + ^) = i(l 

To obtain an upper bound on the Hellinger distance for our problem, we require a 
further technical lemma. 

Lemma 3.3. Let 7 > 1 and let h(x) — X ^ x ^ ■ Then the supremum of h(x) for x 
in the half-open interval [0, 1) equals -. 

Proof. Since 7 > 1 it can be checked that h'(x) > for all x in (0,1), and so 
sup^gjQ h(x) — lim^i h(x). By L'Hopital's rule, we have lim^i h(x) = —. □ 



Proof of Theorem \3.1\ 

If any method has a probability of at least 1 — e of correctly reconstructing each 
of the three binary trees on four taxa from i.i.d. sequences of length k then, by 
Lemma 13.21 with m = 3 we have: 



(I - ^-e) 2 

(4) fc >i!_!lL. dff2 
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where dn is the maximum Hellinger distance between any two of the three trees. 
Now, if each of the three trees has the x,px combination of branch lengths (for 
interior, terminal branches, respectively) then, by symmetry, all three of these 
pairwise Hellinger distances are equal. Moreover, we claim that : 

(5) d H 2 > 2p 2 . 

which together with (H|) requires k > c e p 2 for the choice of c e described. Thus it 
remains to establish ([5|). 

Without loss of generality, T\ = 12|34 and T 2 = 13|24. Now, for all i = 3, . . . , 8, 
we have p Si — q Si ■ Furthermore, p Sl — q S2 and p S2 = q Sl as the given trees are 
identical except for their leaf labelling. Consequently, Eqn. ((T|) can be simplified 
as follows: 

(6) d%{T u %) = 2 |i-^Vp*J= 2 ( 1_ ^ PSl " 2 ^^) 

(7) = 2 (1 - {1 - p Sl - p S2 ) - 2^p^) 

(8) = 2(p Sl +p S2 -2^p—pX) 

Let 5 = i6> 2p (l — 6). Then p Sl = p S2 + S, and so Eqn. §8§ can be re-written as: 



( 9 ) 4cn,r,)-*.(i + JL-^i + £}. 

Applying the inequality y/1 + y > 1 + I — for any y > 0, to y — 
gives: 

1 

- = 2 — - < - 

Ps 



d 2 H (T u T 2 )< — = 2 



1 



" 2p 2 '' 



where the last inequality follows by invoking Lemma T3. 31 with 7 = 2p, x — 9. This 
establishes (|5|) and thereby completes the proof of the theorem. 



4. An Upper bound: The Performance of Maximum Parsimony 



We now show that the lower bound described above is essentially 'best possible' 
(up to a constant factor) for the given model, as it can be achieved for a certain 
choice of a: by a simple tree reconstruction method, namely Maximum Parsimony 
(MP). This method selects the tree that requires the smallest number of substitu- 
tions to extend the sequences at the tips of the tree to (ancestral) sequences at all 
the interior vertices of the tree (for further background, the reader can consult, for 
example, [3j or [13]). 

The probability that MP correctly reconstructs the true tree 12|34 will be called 
the MP reconstruction probability. In the following theorem, and subsequently, the 
notation c^ p C indicates that c/C converges to 1 as p grows. Let /(e) denote the 
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one-sided e-critical value for the standard normal distribution, defined by: 

z 

/(e) = z & [ -^—e~ e ' 2 dt = e. 

— oo 

Theorem 4.1. Suppose k sites evolve i.i.d. under a symmetric two-state model on 
some (unknown) four-taxon tree that has branch length x on the interior edge and 
px on each terminal edge. If k > c'p 2 f(^) 2 , where c' ~ p 4e 2 , an interior branch 
length x exists for which the MP reconstruction probability is at least 1 — e. 



In order to prove this theorem, some preliminary work is required. Suppose we 
generate a sequence C of k i.i.d. sites under the symmetric two-state model. Define 
the random variables Xi and Y k as follows. Let: 

if i th character in C is of the kind (a, a, (3, (3); 
Xi = t — 1, if i th character in C is of the kind (a, /?, a, 



else. 



and let: 



Y k = Y,Xi 



The probability that MP will favour the tree 12|34 over 13|24 is then P(Y k > 0). 
We will exploit the fact that the random variables Xi are i.i.d., and so Y k can be 
approximated for large A: by a normal distribution with a mean fi k and a standard 
deviation a k . These two parameters can be easily described (just) in terms of 8,p 
and k as follows. 

Lemma 4.2. 



(1) ii k = k-\e 2 *{i-e). 

(2) a\ = k ■ |(1 + 26 4 p +1 - 26 2 p +1 - 8 4 p+ 2 ). 

(3) >Vk-6 2 P(l-6). 



Proof. Since X±, . . . , X k are independent and take values +1,0 and —1, we have: 



(i) Mfc 

(ii) 4 



k- 
k- 



P(X 1 = 1) 
P(*i = 1) 



-F(X 1 



-1)] 

-1) - [PpTi = 1) 



nxi 



-i)r 



Now in the two-state symmetric model and the generating tree in Fig. 1(a), we 
have: 

PpLi = l)=p Sl , andP(X! = -l)=p S2 , 

where p Sl ,p S2 were given above in Eqns. ([2]) and ([3]), respectively. Parts (1) and 
(2) of the lemma now follow by substitution of the expressions for p Sl ,p S2 into (i) 
and (ii) respectively. For Part (3), note that Parts (1) and (2) imply that 



(10) 



^ = Vk-^ 
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where N g = 6 2 p(1 - 0);D 6 = \A + 29 ip + 1 - 29 2p + l - 9 ip + 2 ). We now show that 
D e < 1. We have 1 + 0.56> 2p+1 > 9 2p and so 26» 2 p +1 (1 - 6 2 p + 0.59 2p+1 ) > 0. 
Consequently 1 - 26> 2p+1 (l - 9 2p + 0.59 2p+1 ) < 1, which implies that D 2 < 1. Part 
(3) now follows from iflQ]) by the inequality Dg < 1. □ 

Proof of Theorem \4-l\ Note that the MP reconstruction probability is the proba- 
bility that MP will favour the true tree 12 134 over both alternative trees on four 
taxa, namely 13 1 24 and 14|23. Recall that the event of the tree 12 1 34 being favoured 
over 13| 24 can be expressed as P(Yfc > 0). The event of 12|34 being favoured over 
14| 23 can be expressed similarly by defining the random variables Xi and Yk which 
are analogous to Xi and Yk, using the character (a, (3, (3, a) instead of (a, (3, a, (3). 
Then, the MP reconstruction probability can be written as P (( Yk > 0) n (Yk > 0) 
Let: 

r, Y k - ^k 
Ak = • 

Thus, Zk is the normalised difference of the parsimony score between tree 13 124 and 
12|34 for a k i.i.d. characters generated by the tree in Fig. 1(a). By Lemma |4~2T 3) 
we have 

(11) P(Vfc < 0) = F(Z k < -—) <f(z k < -Vk8 2p (l - 6)) . 

Now, by symmetry of the branch length of the generating tree in Fig. 1(a), we have 
F(Y k < 0) = F(Y k < 0). Moreover, by Boole's inequality: 

P ((y fc > 0) n {Y k > 0)) > 1 - V{Y k < 0) - V(Y k < 0), 

which, combined with (jTTJ) , furnishes the following inequality for the MP recon- 
struction probability: 

(12) P ((Y k > 0) n (Y k > 0)) > 1 - 2P(Yfc < 0) > 1 - 2¥{Z k < -\/k6 2p (l - 9)). 

Now, 9 2p ■ (1 — 9) has a unique local maximum in [0, 1], namely at 9' := 1 — 2 ^ +1 , at 

which it takes the value a p /p, where a p — [ 1 — j^p) • (i+2p) ~p 5 e_1 ' Moreover, 
the difference between the distribution of Zk and a standard normal distribution 
tends uniformly to zero as p (and hence k) grows. This follows by applying standard 
bounds on the central limit theorem approximation (sec, for example, |19j : one 
cannot directly apply the usual form of the central limit theorem as the distribution 
of the X;'s is changing with increasing p). Thus we have P(Z k < — V&^f) < e/2 
provided that k grows at the rate c'p 2 /(f ) 2 for c' ~ p 4e 2 . 

In summary, by (|12p . a value for 9 exists, namely 9' = 1 — l + 2p > an( l thus a value 
for P(e$) — |(1 — 9') = 2 (i-j-2p) ~ 3p a ^ so ex i s t s : f° r which the MP reconstruction 
probability is at least 1 — e. This completes the proof. □ 



4.1. Remarks. 



• Regarding Theorem 14. 1[ other tree reconstruction methods have a similar 
performance to MP when k grows at the rate p 2 . Indeed it is possible that 
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such methods will require shorter sequences, and better statistical proper- 
ties on trees with different tree shapes (as MP is statistically inconsistent 
under some combinations of branch lengths that lie outside those considered 
in the scenario of Fig. 1). We have chosen to consider MP here, because the 
analysis is relatively straightforward and it suffices to prove the matching 
lower p 2 bound. 

One can also derive a (non- asymptotic) form of Theorem l4. 1 1 using Azuma's 
inequality pQ; however, the constant term in place of c e is larger by a factor 
of 32. 

The optimal choice of x of (approximately) i for MP has been observed 
in a slightly different setting by [T5] . 

One can ask whether similar p 2 bounds on k will apply for more complex 
models. We conjecture that for stationary, reversible, finite-state Markov 
processes, the results will be essentially the same for our tree in Fig. 1, up 
to a different constant factor c. 

For Markov processes in which the state space is countably infinite - and 
where a substitution is always to a new state (the 'random cluster model' 
for homoplasy-free evolution, described in [7]) - the situation regarding 
sequence length requirements is quite different. In this case, the required 
sequence length need only grow at the rate p (not p 2 ), as the following 
result shows. 

Proposition 4.3. Suppose k sites evolve i.i.d. under a random cluster 
model model on some (unknown) four-taxon tree that has branch length x 
on the interior edge and px on each terminal edge. Then for a constant c' e 
which depends just on e, the following holds: If k > c' t ■ p, an x exists for 
which the MP reconstruction probability is at least 1 — e. 

Proof. In the random cluster model, the probability of a substitution event 
on an edge e can be written as P(e) = 1 — exp(— I) where I is the expected 
number of changes on the edge (the branch length). Now, the random 
cluster model only generates characters that are homoplasy-free on the 
generating tree; thus MP will return the generating tree from a sequence of 
characters, provided this tree is the only one on which those characters are 
homoplasy-free. For a tree with topology 12|34, this will occur precisely if 
at least one of the k characters generated assigns taxa 1,2 a shared state, 
and taxa 3, 4 a second shared state that is different to that assigned to 1, 2. 
The probability Q that any given character generated by the tree in Fig. 
1(a) has this property is given by: 

Q = P&) 11(1 - P&)) = (1 - e-*)(l - e-^f- 

»=1 

Moreover, if k > log(-)/Q then 1 — (1 — Q) k > 1 — e (using the inequal- 
ity — log(l — Q) > Q). Consequently, MP will correctly reconstruct the 
generating tree with probability at least 1 — e provided that: 

k > \ogie- 1 ) • (1 - e-*)- 1 ^ - e- px )- A . 

Taking x = l/4p we have (1 - e^)" 1 ^ - e^P x )- 4 ~ ^(1 - e" 1 / 4 ), which, 
in view of |[T3|) . establishes the result. □ 
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5. Lower bounds for more general models 



In this section we derive a lower bound on the sequence length required for tree 
reconstruction, for a much wider range of Markov processes. However, unlike the 
previous sections our bound is expressed in terms of the absolute branch lengths 
(or bounds on these) rather than in terms of ratios, and it involves constants that 
depend on the details of the model. 

We first derive a general lemma. Consider any continuous-time, stationary and 
reversible Markov process. Let S denote its state space, and in keeping with earlier 
terminology let S = S 4 (thus in previous sections S — {a, /?}). Let T\ and T 2 be two 
topologically distinct four-taxon trees. Suppose that the branch lengths of 71 are 
arbitrary, and that each edge of T 2 has the corresponding interior or pendant branch 
length specified by T\ (where the pendant edge incident with leaf i in 7"i corresponds 
to the pendant edge incident with leaf i in T 2 ). For s = (si, s 2 , S3, S4) G S, let p s 
(respectively q s ) denote the probability of generating s at the tips of 71 (respectively 
T 2 ). Let p' s (respectively q' s ) denote the conditional probability of generating s at 
the tips of T\ (respectively T 2 ) given that a substitution has occurred on the central 
edge of Ti (respectively T 2 ), and let D s :— q' s — p' s . Then we have the following 
result. 

Lemma 5.1. 

4(7i,T 2 )<; 2 .]T^ 

ses Ps 



where I denotes the branch length of the interior edge of T\. 



Proof. Let r denote the probability that at least one substitution occurs on the 
interior edge of Ti, and let p° s (respectively q° s ) denote the conditional probability 
of generating s on 71 (respectively T 2 ) given that no substitution occurs on the 
interior edge of 71 (respectively T 2 ). By the law of total probability we have: 

Ps = {1-t)- P ° s +t-p' s 

and 

q s = (l-T)-q° s +T-q' s . 
Moreover, the assumptions on the correspondence between branch lengths of 71 
and T 2 imply that p° s = q° s for all s e S and so: 

q s -p s = r(q' s - p' s ) = tD s . 

Now, 



^(7i,T 2 ) = 2(1-J2 VP^s) =2(1- 
ses \ 




Applying the inequality y/1 + y > 1 + | — \ (for all y > — 1) to y = (and 
observing that y > — 1 since q s > 0), we obtain: 



<f„(T u T 2 ) <2(l-£>( 



2 Ps 2p S/ 
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Now, J2sP* = 1) and J2s D * =0 (since J2 s l's = J2 s Ps = 1) and so tnis l ast 
inequality reduces to: 

(14) d ^ Tl ,T 2 )<r 2 -J2—- 

Furthermore, r = ¥(N > 0), where N is the number of substitutions occurring on 
the interior edge of Ty. However, F(N > 0) < E(JV); that is, r < /, which, together 
with (|14|) . provides the inequality stated in the lemma. □ 



We now apply this lemma to a slightly more restricted class of Markov processes 
to obtain the main result of this section. 

Theorem 5.2. Suppose k sites evolve i.i.d. under a finite-state, stationary and 
reversible continuous-time Markov process in which each state is accessible from 
any other state. Let Iq be any strictly positive value. Consider this process on 
some (unknown) four-taxon tree that has branch length at most I on the interior 
edge and at least L > 1$ on each terminal edge. Then any method that is able to 
correctly identify with probability at least 1 — e the underlying tree topologies given 
these restriction reguires: 

fc>j(l-2 e ) 2 .^ 

where c and C are positive constants that depend only on R ( the rate matrix for 
the process) and Iq. 



Proof. We exploit the fact that any Markov process of the type described converges 
to its unique stationary distribution at an exponential rate (see, for example, The- 
orem 8.3 of [H]). Let ir(s) denote the stationary probability of s under the model. 
For j = 1, . . . , 4, let p(j) £ {u, v} be the end of the interior edge uv of T\ that is 
adjacent to leaf j (we may assume p(l) = p(2) = u;p(3) — p(4) = v), and let S p ^ 
denote the random state present at that vertex under the model. Then for any 
Sj, s'j £ S there exist positive constants A, a (dependent on R) for which: 

(15) |P(S, = Sj \S m = 4) - n( Sj )\ < Ae- aL > 

([llj. Theorem 8.3), where Lj denotes the branch length of the edge incident with 
leaf j. For s = (si, S2, S3, S4) £ S = S 4 , let 

4 

For s's" £ S let p'(s',s") denote the probability of generating state s' at u and 
the state s" at v given that at least one substitution occurs on the edge uv. Then, 
by the Markov assumption, and recalling the definition of p' s from Lemma 15. 11 we 
have: 

2 4 

(16) P ' s = p'(^s")-n p (^=^i^= s ')-n p (^=^i^= s ")- 

<y, s ")es 2 j=i j=3 
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Combining (fT5|) and (fl6|) . there exist positive constants B, b (dependent only on R) 
such that: 

(17) \p' s ~n s \<Be- bL 

for all s € S (recall that L < Lj for all j). Now, consider tree T2 which has branch 
lengths that correspond to those in %_ (as in Lemma EH]). Then we also have: 

(18) \q' s -n s \<Be- bL 

for all s £ S. Combining (fl"7|) and (fl8|) using the triangle inequality gives: 

(19) \D s \=\q s -p s \<2Be- bL . 

Moreover, since Lj > Zo (for all j) and each state is accessible from any other state, 
we have p s > 5 (for some S > dependent only on i? and Iq). Combining this with 
(HHJ) gives the following inequality, for all s S S: 

D 2 

(20) — s - < UB 2 /S)e- 2bL . 

Ps 

The theorem now follows from Lemma 15.11 and Lemma 13.21 (with m = 2) . □ 



6. Concluding remarks 



In this paper we have provided precise results for a specific and simple model 
(the two-state symmetric process), along with less explicit results for more general 
Markov processes (and phrased in terms of absolute rather than relative branch 
lengths) . The aim is to determine rigorous bounds on the sequence length required 
for resolving a deep divergence, which may she light on debates as to whether some 
early radiations might be fundamentally unresolvable on the basis of current models 
and data. 

Of course, in applications, other phenomena (such as lineage sorting, misalign- 
ment of sequences, sequencing errors and so forth) may further impede phylogenetic 
reconstruction (including substitution model mis-specification, lineage sorting and 
alignment artifacts [5]), however these errors are unlikely to help tree reconstruction 
if our bound shows it is impossible even when the idea model assumptions hold. 
We have seen that some models require significantly fewer characters for resolving 
a tree - in particular this holds for the random cluster model, and it is possible 
that new types of genomic data (involving rare genomic events where homoplasy 
is unlikely) can be described by these and related processes that preserve more 
phylogenetic signal regarding distant evolutionary divergences. 

One limitation concerning our bounds is that they apply to pure Markov pro- 
cesses, in which each character evolves according to the same process. In molecular 
biology a common assumption is that there is a distribution of rates across sites, 
in which each sites evolves at a rate (selected i.i.d. from some distribution) that 
acts as a multiplier for all the branch lengths in the tree (see e.g. [T3] ) . It would 
be interesting to extend the analysis in the last section to these models to obtain a 
lower bound on k analogous to Theorem 15.21 
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